knitr::opts_chunk$set(comment =NA)library(janitor)library(naniar)library(broom); library(gt); library(patchwork)library(haven) ## for zapping labelslibrary(mosaic) ## auto-loads mosaicData - data sourcelibrary(GGally) ## for scatterplot matrixlibrary(rsample)library(yardstick)library(rms) ## auto-loads Hmisclibrary(easystats)library(tidyverse)theme_set(theme_bw())
Data from the HELP study
New Data (The HELP study)
Today’s main data set comes from the Health Evaluation and Linkage to Primary Care trial, and is stored as HELPrct in the mosaicData package.
HELP was a clinical trial of adult inpatients recruited from a detoxification unit. Patients with no primary care physician were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, with the goal of linking them to primary medical care.
Key Variables for Today
Variable
Description
id
subject identifier (note: \(n\) = 453 subjects)
cesd
Center for Epidemiologic Studies Depression measure (scale is 0-60; higher scores indicate more depressive symptoms)
age
subject age (in years)
sex
female (n = 107) or male (n = 346)
subst
primary substance of abuse (alcohol, cocaine or heroin)
The CES-D is a 20-item measure that asks people to rate how often over the past week they experienced symptoms associated with depression, such as restless sleep, poor appetite, and feeling lonely.
Each item is rated on a 0-3 scale, and then summed, so possible scores range from 0 to 60.
Higher scores indicate more symptoms (or more frequent symptoms.)
A version of the CES-D scale is available here as a PDF.
A cutoff for CES-D: Our binary outcome
Scores of 16 or higher on the CES-D scale are sometimes taken to indicate that a person is at risk for clinical depression.
p1 <-ggplot(help1, aes(sample = cesd)) +geom_qq(col ="slateblue") +geom_qq_line(col ="red") +theme(aspect.ratio =1) +labs(y ="Sorted CES-D Scores", x ="Standard Normal Distribution")bw =5# I tried a couple of things - this worked best for me with these datap2 <-ggplot(help1, aes(x = cesd)) +geom_histogram(binwidth = bw, fill ="slateblue", col ="gold") +stat_function(fun =function(x) dnorm(x, mean =mean(help1$cesd), sd =sd(help1$cesd)) *length(help1$cesd) * bw,geom ="area", alpha =0.5, fill ="thistle", col ="red") +labs(y ="Number of Subjects", x ="CES-D Score")p3 <-ggplot(help1, aes(x = cesd, y ="")) +geom_violin(fill ="slateblue") +geom_boxplot(width =0.3, col ="gold", notch =TRUE, outlier.color ="slateblue") +stat_summary(fun ="mean", geom ="point", col ="red") +labs(x ="CES-D Score", y ="")p1 + (p2 / p3 +plot_layout(heights =c(4,1))) +plot_annotation(title ="CES-D Depression Scores from help1 data",subtitle ="Higher CES-D scores indicate more severe depressive symptoms",caption ="n = 453, no missing data")
Quantitative Outcome (CES-D)
Describing our outcome, CES-D (1/2)
describe(help1$cesd) ## describe comes from the Hmisc package
Gmd = Gini’s mean difference, a robust measure of variation. If you select two subjects at random many times, the mean cesd difference will be 14.23 points.
temp <- help1 |>select(age, mcs, pcs, pss_fr, sex, subst, cesd)ggpairs(temp) ## ggpairs from the GGally package
We place the outcome (cesd) last (result on next slide.)
Saving the Data Set
write_rds(help1, "c05/data/help1.Rds")
Scatterplot Matrix (result)
Using ols() to fit a linear regression model
Fitting using ols()
The ols function stands for ordinary least squares and comes from the rms package, by Frank Harrell and colleagues. Any model fit with lm can also be fit with ols.
To predict var_y using var_x from the my_tibble data, we would use the following syntax:
dd <-datadist(my_tibble)options(datadist ="dd")model_name <-ols(var_y ~ var_x, data = my_tibble,x =TRUE, y =TRUE)
This leaves a few questions…
What’s the datadist stuff doing?
Before fitting an ols model to data from my_tibble, use:
dd <-datadist(my_tibble)options(datadist ="dd")
Run (the datadist code above) once before any models are fitted, storing the distribution summaries for all potential variables. Adjustment values are 0 for binary variables, the most frequent category (or optionally the first category level) for categorical (factor) variables, the middle level for ordered factor variables, and medians for continuous variables. (excerpt from datadist documentation)
Why use x = TRUE, y = TRUE?
Once we’ve set up the summaries with datadist, we fit a model:
model_name <-ols(var_y ~ var_x, data = my_tibble,x =TRUE, y =TRUE)
ols stores additional information beyond what lm does
x = TRUE and y = TRUE save even more expanded information for building plots and summarizing fit.
The defaults are x = FALSE, y = FALSE, but in 432, we’ll want them saved.
Using ols to fit a model
Let’s try to predict our outcome (cesd) using mcs and subst
Start with setting up the datadist
Then fit the model, including x = TRUE, y = TRUE
dd <-datadist(help1)options(datadist ="dd")mod1 <-ols(cesd ~ mcs + subst, data = help1,x =TRUE, y =TRUE)
Contents of mod1?
mod1
Linear Regression Model
ols(formula = cesd ~ mcs + subst, data = help1, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 295.10 R2 0.479
sigma9.0657 d.f. 3 R2 adj 0.475
d.f. 449 Pr(> chi2) 0.0000 g 9.827
Residuals
Min 1Q Median 3Q Max
-25.43696 -6.74592 0.09334 6.16212 24.24842
Coef S.E. t Pr(>|t|)
Intercept 55.3026 1.2724 43.46 <0.0001
mcs -0.6570 0.0337 -19.48 <0.0001
subst=cocaine -3.4440 1.0055 -3.43 0.0007
subst=heroin -1.7791 1.0681 -1.67 0.0965
New elements in ols
For our mod1,
Model Likelihood Ratio test output includes LR chi2 = 295.10, d.f. = 3, Pr(> chi2) = 0.0000
The log of the likelihood ratio, multiplied by -2, yields a test against a \(\chi^2\) distribution. Interpret this as a goodness-of-fit test that compares mod1 to a null model with only an intercept term. In ols this is similar to a global (ANOVA) F test.
New elements in ols
Under the \(R^2\) values, we have g = 9.827.
This is the \(g\)-index, based on Gini’s mean difference. If you randomly selected two of the subjects in the model, the average difference in predicted cesd will be 9.827.
This can be compared to the Gini’s mean difference for the original cesd values, from describe, which was Gmd = 14.23.
The data used to fit the model provide an over-optimistic view of the quality of fit.
We’re interested here in assessing how well the model might work in new data, using a resampling approach.
Resampling Validation for \(R^2\)
–
index.orig
training
test
optimism
index.corrected
n
\(R^2\)
0.4787
0.4874
0.4737
0.0137
0.4650
40
index.orig for \(R^2\) is 0.4787. That’s what we get from the data used to fit mod1.
With validate we create 40 (by default) bootstrapped resamples of the data and then split each of those into training and test samples.
For each of the 40 splits, R refits the model (same predictors) in the training sample to obtain \(R^2\): mean across 40 splits is 0.4874
Check each model in its test sample: average \(R^2\) was 0.4737
optimism = training result - test result = 0.0137
index.corrected = index.orig - optimism = 0.4650
While our nominal\(R^2\) is 0.4787; correcting for optimism yields validated\(R^2\) of 0.4650, so we conclude that \(R^2\) = 0.4650 better estimates how mod1 will perform in new data.
Resampling Validation for MS(Error)
–
index.orig
training
test
optimism
index.corrected
n
MSE
81.4606
79.7851
82.2361
-2.4510
83.9116
40
index.orig for MSE = 81.4606. That’s what we get from the data used to fit mod1.
For each of the 40 splits, R refits the model (same predictors) in the training sample to obtain MSE: mean across 40 splits is 79.7851
Check each model in its test sample: average MSE was 82.2361
optimism = training result - test result = -2.4510
index.corrected = index.orig - optimism = 83.9116
While our nominal MSE is 81.4606 (so RMSE = \(\sqrt{81.4606} = 9.03\)); correcting for optimism yields validated MSE of 83.9116 and validated RMSE = \(\sqrt{83.9116} = 9.16\).
ANOVA for mod1 fit by ols
anova(mod1)
Analysis of Variance Response: cesd
Factor d.f. Partial SS MS F P
mcs 1 31182.7237 31182.72373 379.42 <.0001
subst 2 968.7563 484.37816 5.89 0.003
REGRESSION 3 33886.8359 11295.61195 137.44 <.0001
ERROR 449 36901.6542 82.18631
This adds a line for the complete regression model (both terms) which can be helpful, but is otherwise the same as anova() after a fit using lm().
As with lm, this is a sequential ANOVA table, so if we had included subst in the model first, we’d get a different SS, MS, F and p for mcs and subst, but the same REGRESSION and ERROR results.
Effect of mcs: -12.66 is the estimated change in cesd associated with a move from mcs = 21.68 (see Low value) to mcs = 40.94 (the High value) assuming no change in subst.
ols chooses the Low and High values from the interquartile range.
quantile(help1$mcs, c(0.25, 0.75))
25% 75%
21.67575 40.94134
Plot the summary to see effect sizes
Goal: plot effect sizes for similar moves within predictor distributions.
plot(summary(mod1))
The triangles indicate the point estimate, augmented with confidence interval bars.
The 90% confidence intervals are plotted with the thickest bars.
The 95% CIs are then shown with thinner, more transparent bars.
Finally, the 99% CIs are shown as the longest, thinnest bars.
At left, impact of changing mcs on cesd holding subst at its baseline (alcohol).
At right, impact of changing subst on cesd holding mcs at its median (28.602417).
Defaults: add 95% CI bands and layout tries for a square.
Build a nomogram for the ols fit
plot(nomogram(mod1))
Nomograms
For complex models (this model isn’t actually very complex) it can be helpful to have a tool that will help you see the modeled effects in terms of their impact on the predicted outcome.
A nomogram is an established graphical tool for doing this.
Find the value of each predictor on its provided line, and identify the “points” for that predictor by drawing a vertical line up to the “Points”.
Then sum up the points over all predictors to obtain “Total Points”.
Draw a vertical line down from the “Total Points” to the “Linear Predictor” to get the predicted cesd for this subject.
Using the nomogram for the mod1 fit
Predicted cesd if mcs = 35 and subst = heroin?
Actual Prediction for this subject…
The predict function for our ols fit provides fitted values.
The lrm() function stands for logistic regression model and also comes from the rms package. Let’s predict our binary outcome (cesd_hi) using mcs and subst.
Start with setting up the datadist Then fit model, including x = TRUE, y = TRUE
dd <-datadist(help1)options(datadist ="dd")mod2 <-lrm(cesd_hi ~ mcs + subst, data = help1, x =TRUE, y =TRUE)
Contents of mod2?
mod2
Logistic Regression Model
lrm(formula = cesd_hi ~ mcs + subst, data = help1, x = TRUE,
y = TRUE)
Model Likelihood Discrimination Rank Discrim.
Ratio Test Indexes Indexes
Obs 453 LR chi2 134.24 R2 0.533 C 0.938
0 46 d.f. 3 R2(3,453)0.252 Dxy 0.875
1 407 Pr(> chi2) <0.0001 R2(3,124)0.653 gamma 0.875
max |deriv| 6e-06 Brier 0.056 tau-a 0.160
Coef S.E. Wald Z Pr(>|Z|)
Intercept 10.5778 1.2429 8.51 <0.0001
mcs -0.1796 0.0235 -7.64 <0.0001
subst=cocaine -1.5025 0.4811 -3.12 0.0018
subst=heroin -1.2695 0.5979 -2.12 0.0337
New elements in lrm
For our mod2,
Model Likelihood Ratio test output includes LR chi2 = 134.24, d.f. = 3, Pr(> chi2) <0.0001
Again, the log of the likelihood ratio, multiplied by -2, yields a test against a \(\chi^2\) distribution. Interpret this as a goodness-of-fit test that compares mod2 to a null model with only an intercept term.
The R2 value is the Nagelkerke\(R^2\), which is another pseudo-\(R^2\) measure that provides a correction to the Cox-Snell \(R^2\) so that the maximum value is 1.
The Brier score is the mean squared error between predictions and actual (1/0) observations. The lower the score (closer to 0), the better the model’s predictions are calibrated. It’s not really useful on its own, but helps when comparing models.
Dxy = Somers’ d, and the area under the ROC curve is C = 0.5 + (Dxy/2)
Our original Dxy = 0.8751, implying C = 0.9376
Our validated Dxy = 0.8634, so validated C = 0.5 + (0.8634/2) = 0.9317
While our nominal\(R^2\) is 0.5326; correcting for optimism yields validated\(R^2\) of 0.5152.
ANOVA for mod2 fit by lrm
anova(mod2)
Wald Statistics Response: cesd_hi
Factor Chi-Square d.f. P
mcs 58.43 1 <.0001
subst 10.04 2 0.0066
TOTAL 62.30 3 <.0001
Again, this is a sequential ANOVA table, so if we had included subst in the model first, we’d get a different Chi-Square, and p for mcs and subst, but the same TOTAL result.
summary for mod2 fit by lrm
summary(mod2, conf.int =0.90)
Effects Response : cesd_hi
Factor Low High Diff. Effect S.E. Lower 0.9
mcs 21.676 40.941 19.266 -3.460400 0.45270 -4.20500
Odds Ratio 21.676 40.941 19.266 0.031417 NA 0.01492
subst - cocaine:alcohol 1.000 2.000 NA -1.502500 0.48114 -2.29390
Odds Ratio 1.000 2.000 NA 0.222580 NA 0.10087
subst - heroin:alcohol 1.000 3.000 NA -1.269500 0.59788 -2.25290
Odds Ratio 1.000 3.000 NA 0.280980 NA 0.10509
Upper 0.9
-2.715800
0.066152
-0.711070
0.491120
-0.286070
0.751210
summary for mod2 fit by lrm
Factor Low High Diff. Effect S.E. Lower 0.9 Upper 0.9
mcs 21.676 40.941 19.266 -3.46040 0.4527 -4.2050 -2.71580
Odds Ratio 21.676 40.941 19.266 0.03142 NA 0.0149 0.06615
Odds of cesd_hi are 0.03 times as high for a subject with mcs = 40.94 (High) as compared to a subject with mcs = 21.68 (Low) assuming no change in subst.
ols chooses the Low and High values from the interquartile range.
summary for mod2 fit by lrm
Factor Low High Diff Effect S.E. Lower 0.9 Upper 0.9
subst - cocaine:alcohol 1 2 NA -1.5025 0.4811 -2.2939 -0.71107
Odds Ratio 1 2 NA 0.2226 NA 0.1009 0.49112
subst - heroin:alcohol 1 3 NA -1.2695 0.5979 -2.2529 -0.28607
Odds Ratio 1 3 NA 0.2810 NA 0.1051 0.75121
Effect of subst being cocaine instead of alcohol on cesd_hi is an Odds Ratio of 0.22 (0.10, 0.49), assuming no change in mcs.
Effect of subst being heroin instead of alcohol on cesd_hi is an Odds Ratio of 0.28 (0.11, 0.75), assuming no change in mcs.
Plot the summary to see effect sizes
Goal: plot effect sizes for similar moves within predictor distributions.